County-level Data for 2016 Presidential Elections

Introduction

This is an analysis of US presidential elections data for 2016 at the county level. Since only a small percentage of votes went to independent candidates, we will only compare Democrat and Republican voteshare.

The data for this analysis is taken from https://github.com/tonmcg/County_Level_Election_Results_12-16.

Data import and checking

Library imports:

library(dplyr)
library(ggplot2)
library(knitr)
library(readr)

Read in the data:

df <- read_csv("2016_US_County_Level_Presidential_Results.csv")
kable(head(df))

X1	votes_dem	votes_gop	total_votes	per_dem	per_gop	diff	per_point_diff	state_abbr	county_name	combined_fips
0	93003	130413	246588	0.3771595	0.52887	37410	15.17%	AK	Alaska	2013
1	93003	130413	246588	0.3771595	0.52887	37410	15.17%	AK	Alaska	2016
2	93003	130413	246588	0.3771595	0.52887	37410	15.17%	AK	Alaska	2020
3	93003	130413	246588	0.3771595	0.52887	37410	15.17%	AK	Alaska	2050
4	93003	130413	246588	0.3771595	0.52887	37410	15.17%	AK	Alaska	2060
5	93003	130413	246588	0.3771595	0.52887	37410	15.17%	AK	Alaska	2068

There are 3,141 rows in total, matching the number of counties in the US (Source: http://www.snopes.com/trump-won-3084-of-3141-counties-clinton-won-57/ and http://www.wnd.com/2016/12/trumps-landslide-2623-to-489-among-u-s-counties/).

The dataset contains the following columns:

names(df)

##  [1] "X1"             "votes_dem"      "votes_gop"      "total_votes"   
##  [5] "per_dem"        "per_gop"        "diff"           "per_point_diff"
##  [9] "state_abbr"     "county_name"    "combined_fips"

per_dem and per_gop refer to the percentage of votes going to Democrats and Republicans respectively.
diff represents the absolute difference between Republican votes - Democrat votes.
per_point_diff represents this difference as a percentage of total votes.
combined_fips is a 5-digit code identifying the county. (From Wikipedia: The FIPS county code is a five-digit Federal Information Processing Standards (FIPS) code (FIPS 6-4) which uniquely identifies counties and county equivalents in the United States, certain U.S. possessions, and certain freely associated states.)

combined_fips is the column that we are going to use to combine our elections data with our mapping data.

Since we are interested in whether a given county had more Republican or Democrat votes, we have to recompute the diff and per_point_diff columns. diff and per_point_diff will be positive if there are more Republican votes than Democrat votes (and vice versa).

df <- df %>% mutate(diff = votes_gop - votes_dem,
                    per_point_diff = diff / total_votes * 100)

Summary statistics

Compute percentage of popular vote won by each party:

paste0("Republican % of popular vote: ", 
       round(sum(df$votes_gop) / sum(df$total_votes) * 100, digits = 1),
       "%")

## [1] "Republican % of popular vote: 47.3%"

paste0("Democrat % of popular vote: ", 
       round(sum(df$votes_dem) / sum(df$total_votes) * 100, digits = 1),
       "%")

## [1] "Democrat % of popular vote: 47.5%"

Although Clinton lost the presidential election, she actually won the popular vote!

Compute number of counties won by each party:

df %>% transmute(gop_won = votes_gop > votes_dem) %>%
    summarize(gop_won = sum(gop_won))

## # A tibble: 1 x 1
##   gop_won
##     <int>
## 1    2654

Painting a completely different picture, Trump won 2654 out of 3141 counties (or 84.5% of all counties). Clinton only won 487 counties. This suggests that Clinton won in counties with large populations, or that the margin of victory was slimmer in the counties that Trump won compared with the counties that Clinton won.

Histograms

We have Clinton winning the popular vote on one hand, but Trump winning many more counties. How can we reconcile these two facts?

One theory is that Clinton won her counties by a huge margin percentage-wise, while Trump won his counties by a slim margin percentage-wise. To test this theory, we could plot a histogram of the per_point_diff:

ggplot() +
    geom_histogram(data = df, mapping = aes(x = per_point_diff)) + 
    labs(title = "Histogram of % vote margin", 
         x = "% Republicans won by", y = "Frequency")

The chart does not support the theory that Trump had narrower margins of victory in the counties that he won: he won a sizeable number of counties with > 50% vote difference.

Let’s try plotting a histogram of diff to look at absolute differences instead:

ggplot() +
    geom_histogram(data = df, mapping = aes(x = diff)) + 
    labs(title = "Histogram of absolute vote margin", 
         x = "No. of votes Republicans won by", y = "Frequency")

This chart is very different! In the counties that Clinton won, she won it by extremely large margins in terms of absolute votes. Thus, even though she won very few counties compared to Trump, these large margins meant that she could actually win the popular vote.

The code below shows that the top 45 counties with largest absolute vote difference were all won by Clinton (number 46 was Montgomery, TX, which went to Trump).

df %>% select(State = state_abbr, County = county_name, diff) %>%
    mutate(abs_diff = abs(diff)) %>%
    arrange(desc(abs_diff)) %>%
    select(State, County, `Vote difference` = diff) %>%
    head(n = 50) %>%
    kable()

State	County	Vote difference
CA	Los Angeles County	-1273485
IL	Cook County	-1088369
NY	Kings County	-461433
WA	King County	-459368
NY	New York County	-456546
PA	Philadelphia County	-455124
CA	Alameda County	-395162
CA	Santa Clara County	-346020
NY	Queens County	-334839
MA	Middlesex County	-292756
FL	Miami-Dade County	-289340
MI	Wayne County	-288934
FL	Broward County	-288435
MD	Prince George’s County	-284337
NY	Bronx County	-283979
CA	San Francisco County	-277950
DC	District of Columbia	-248670
MN	Hennepin County	-237515
MD	Montgomery County	-226776
OR	Multnomah County	-208699
OH	Cuyahoga County	-204080
TX	Dallas County	-196980
VA	Fairfax County	-196648
GA	DeKalb County	-191600
MA	Suffolk County	-191170
CA	Contra Costa County	-180839
CA	San Diego County	-180436
TX	Travis County	-179725
GA	Fulton County	-171503
NJ	Essex County	-168972
CA	San Mateo County	-165849
WI	Milwaukee County	-162895
TX	Harris County	-161511
MD	Baltimore city	-155836
WI	Dane County	-146236
OH	Franklin County	-143633
NC	Mecklenburg County	-137955
FL	Orange County	-134488
CO	Denver County	-130974
NY	Westchester County	-124027
CA	Sacramento County	-110744
LA	Orleans Parish	-109566
MN	Ramsey County	-106151
PA	Allegheny County	-105529
NC	Wake County	-104746
TX	Montgomery County	104444
NJ	Hudson County	-104365
FL	Palm Beach County	-100649
MA	Norfolk County	-99958
TN	Shelby County	-91692

Optional material: An alternative to histograms

We didn’t go through this section in class as it uses ggplot in a slightly more involved way. Take a look at the code and see if you can figure out what’s going on.

In the chart below, each vertical bar represents one county.

temp <- df %>% arrange(desc(per_point_diff))
ggplot(data = temp) + 
    geom_col(mapping = aes(x = as.integer(row.names(temp)), 
                           y = per_point_diff,
                           fill = per_point_diff)) +
    scale_fill_gradient2(low = "blue", high = "red") + 
    labs(title = "% Difference in votes received by county", 
         x = "", y = "% Difference") + 
    theme(legend.position = "right")

The chart above does not support the theory that Trump had narrower margins of victory in the counties that he won, as compared to Clinton.

Let’s make the same chart, but with absolute difference in votes received (instead of percentage difference):

temp <- df %>% 
    select(diff) %>%
    arrange(desc(diff))
ggplot(data = temp) + 
    geom_col(mapping = aes(x = as.integer(row.names(temp)), 
                           y = diff,
                           fill = diff)) +
    scale_fill_gradient2(low = "blue", high = "red") + 
    labs(title = "Absolute difference in votes received by county", 
         x = "", y = "Absolute difference")

Conclusion

When analyzing elections, we have to examine the data from many different perspectives in order to get the full story.